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ABSTRACT 

Conventionally, reliability analyses either assume that a fault/error is 
detected immediately following its occurrence, or neglect damages caused by 
latent errors. Though unrealistic, this assumption has been imposed in order to 
avoid the difficulty of determining the respective probabilities that a fault 
induces an error and the error is then detected in a random amount of Lime 
after its occurrence. 

As a remedy for this problem, in this paper a model is proposed to analyze 
the impact of error detection on computer performance under moderate 
assumptions. Error latency - the time interval between occurrence of an error 
and the moment of error detection - is used to measure the effectiveness of a 
detection mechanism. We have used this model to (1) perdict the probability of 
producing an unreliable result, and (2) estimate the loss of computation due to 
fault and/or error. 


-4 

S3 


r-i 

o 


C/J 

m 

n 

Hi 

|— 

c: ' 

1 

Pi 

a? 

— ; 

n 

j •• f i 1 

■y 

-< 

cry 

IV' 

(\J 


The work reported there is supported in part by NASA grant So. SAG 1-296. A!1 correspondence 
should be made to Prcf. K. G. Shin, rlCK Dept., The University of Mic.iigan, Ann Arbor, MI, 48109. 


1 


1. INTRODUCTION 

During the past decade, many reliability-related models for fault-tolerant 
computers have been developed. Based on system structures and operation 
strategies, these models predict Vi rious measures such as reliability, computa- 
tion capacity, performability, etc. Usually, in these models, a probability distri- 
bution function is used to describe the occurrence of component or system 
failure. The results represent the time-varying characteristics of a computer 
system. Since only the occurrence of failure is included in these models, they 
fail to cover the following two aspects One is the existence of a latent fault in 
which case a fault is present but no erroneous state is induced. The other is the 
possible error latency because the error may not be detected immediately fol- 
lowing its occurrence. 

Consider the property of a fault. An input signal to a computer may cause 
the fault to induce some erroneous states, or it may simply pass through this 
fault and produce a correct output. The fault is said to be Latent if it does not 
harm normal operations. Bavuso, et al., investigated the problem of latent fault 
and proposed experiments to measure fault latency ':]. Their studies indicate 
that a significant proportion of faults remained latent after many repetitions of 
a program. This fault latency has an important impact on an ultra-reliable sys- 
tem since it may cause a catastrophe if more than one latent fault becomes 
active. 

It is desired that the error detection mechanisms associated with the sys- 
tem identify an error immediately upon generation. In fact, some errors may 
not be captured by error detection mechanisms when it occurs and then spread 
as a result of subsequent How of information. Thus the damage by the error will 
be propagated until it is identified. The delay between the occurrence of an 
error and the moment of detection, called error latency, is important to damage 
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assesment, error recovery, and confidence in computed results. The same 
notion has been defined by Courtois as detection time [2,3] and by Shedletsky as 
latency difference [4]. Courtois also showed the results of on-line testing of the 
M6800 microprocessor and presented the distributions of detection time for cre- 
tain detection mechanisms. Shedletsky proposed a technique to evaluate the 
error latency based on the "fault set" philosophy and the probability distribution 
of input signals. The resultant error latency was used to decide the required 
rollback distance for successful data restoration. Both of these are confined to 
the study of error detection capability and are not extended to include the 
impacts of erro - . detection on the system performance 

In reconfigurable fault-tolerant systems, a task executed by processors can 
be recovered through various recovery methods if one of the resident proces- 
sors fails. Thus these systems are considered to have failure only when all 
resources are exhausted or the system fails to reconfigure. In practice, in addi- 
tion to the probability oi system failure, one may question what would happen if 
the system can not respond to a fault /error immediately following its 
occurrence. With the existence of error latency, the system mav send out some 
erroneous computation results if it is still unaware of the error at the output 
phase. On the other hand, even if the system has detected the error before u is 
propagated, the computation achieved during error latency is useless and the 
whole system suffers from the delay caused by error latency and recovery. So 
the total cost induced by fault and/or error consists of two parts: one is the 
computation los* which includes error detection overhead, error latency, and 
recovery overhead, the other part is the relative cost increased due to delayed 
response. To quantify these effects of error latency, the probability of having an 
unreliable result and the computation loss have to be evaluated. 

Various error detection techniques can be used to reduce the computation 
loss and eniiance the reliability of computation results, for instance, 
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enhancement of self-checking capability so that most of errors can be detected 
immediately, limitation of the error contaminations ta reduce error latency and 
recovery overhead, and periodic diagnostics which can seize faults before they 
induce errors. Each of these techniques by itself may not provide acceptable 
solutions to the reliability problem without high cost or overhead. Instead, a 
combination of these techniques must be employed to obtain a good, reliable 
performance at a reasonable cost. 

In this paper, a model is proposed to describe error detection processes 
and to estimate their influences on system performance. Because intermittent 
faults can seriously degrade performance and can cause a large fraction of all 
errors [5], the mode, is intended to study their impact. In the following section, 
the classification and the properties of error defection mechanisms are dis- 
cussed first. The model is developed in Section 3. Section 4 presents the evalua- 
tion of the probability of having an unreliable result and the estimation of aver- 
age computation loss. The optimal strategy of periodic diagnostic is also dis- 
cussed in this section. A brief conclusion follows in Section 5. 

Note that in this paper we consider faults in hardware components which 
may cause a transition to erroneous states during the normal operation. 'iVe 
also assume that there is no design fault in the system. An error is defined to be 
the consequent erroneous information/data caused by fault(s). 


2. CLASSIFICATION OF ERROR DETECTION MECHANISMS 

There are various error detection mechanisms which can be incorporated 
in a computer system. The basic principle of these mechanisms is the use of 
redundancy in devices, information, or lime. Based on where these mechanisms 
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are employed and their respective performance measures, they are divided into 
the following three classes. 

1. Signal level detection mechanisms 

Usually, the mechanisms in this category are implemented by built-in self- 
checking circuits. Whenever an error is caused by a predescribed fault, these 
circuits detect the malfunction immediately even if the erroneous signal does 
not have any logical meaning. Typical methods include error detection codes, 
duplicated complementary circuits, matcher, etc. Since the error is induced 
only when an input signal falls into the corresponding fault set of the fault, the 
fault latency will depend on the type of fault and the distribution of input signal. 
On the other hand, the error is detected immediately whenever it is generated. 
These detection mechanisms are difficult to have complete detectability for all 
kinds of error because (l) it is prohibitively expensive to design detection 
mechanisms which cover all types of faults, and (2) physical dependence 
between function units and detection mechanisms cannot be totally avoided. The 
performance of these detection mechanisms is measured by "coverage", which 
is the probability of detecting an arbitrary fault. 

2. Function level detection mechanisms 

The detection mechanisms in this level are intended to chock out unaccept- 
able activities or information at a higher level than the previous category. These 
detection mechanisms could be imagined as "barriers" around the norma! 
operations. After an error is generated by a fault, the resulting abnormality may 
grow very quickly which is called "snow ball effect" [3], or "error rate 
phenomenon" 6], until it hits these barriers. We can apply several software and 
hardware techniques such as capability checking, acceptance checking, invalid 
op-code, timeout, and the like. Compared with the mechanisms in the first 
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category, these detection mechanisms are more flexible and inexpensive but the 
error latency tends to increase. The effectiveness of these detection mechan- 
isms is very difficult to evaluate since it depends not only on the program exe- 
cuted and the current system states, but on the type of errcr. 

3. Periodic diagnostic 

This method is usually referred to as off-line testing because the computa- 
tion unit can not perform any useful task while it is applied. It is composed of a 
diagnostic program which imitates inputs such that all existing faults are 
activated and thus produce errors. Several theoretic approaches exist to deter- 
mine the probability of finding an error after a certain amount of test time 
(equivalent to the probability of detecting fault in this case) ’7,8]. Tasar also 
provided a simulation to show the coverage of a self-testing program [9]. All 
these results indicate tnat the effectiveness of the present category is a mono- 
tonically increasing function of run time. Since the time required for complete 
testing is too long, it is impractical to apply this method frequently during nor- 
mal operation. An alternative is to perform an imperfect diagnostic periodically 
during normal operation or perform a thorough diagnostic when the system is 
idle. 


3. Mi DEL DEVELOPMENT 


We have developed a model for describing error detection mechanisms as in 
Figure 2. The model consists of three parts: the occurrence of a fault, the conse- 
quent generation of an error, and the detection of that error. Since the proba- 
bility of having a double fault at a time is negligible, the case of multiple faults is 
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excluded from this model. There are six states in the model as follows' 

1) . NF (non-faulty): In this state no fault exists in the system. 

2) . F (faulty): There is a fault which is active and capable of inducing 

errors. 

3) . FB (fault-benign): There is an inactive intermittent fault. 

4) . E (error): There is at least one error in the system and the fault which 

has yielued erroneous information is still present. 

5) . ENF (error-non-faulty): In this state the intermittent fault has become 

inactive after it induced some errors. 

6) . D (detection): In this state, the detection mechanisms have 

identified some errors in the ‘system. 

Usually, the occurrence of a fault is regarded as a Poisson process with 
rate X. Since the system may contain an inactive intermittent fault, a benign 
state has to be included in the model. Several models of intermittent faults were 
proposed and used for testing and reliability evaluation '10 - 15]. In our model, 
the transitions between states NF, F, FB, and between states E, ENF are used to 
describe the behavior of faults. 

Suppose there exists a process for generating errors by a given fault. With 
the assumption that the signal patterns in successive inputs are independent, 
the period of fault latency can be considered to be a random variable with a 
hyperexponential distribution (or composite geometric distribution for discrete 
inputs or cycles [Bj). Using the concepts of information theory, Agrawal 
presented a formula to estimate the probability of inducing error *16], In fact, 
because of the memoryness of sequential circuits and the dependence of execu- 
tion sequence, the assumption of independent successive inputs is invalid In our 
model, an exponentially distributed fault latency with rate a is assumed for sim- 
plicity. Since fault latency is generally much smaller than the life cycle, this 
assumption would not degenerate the accuracy of the whole model. Before an 
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error is induced by a fault, the system may transfer immediately into state D if 
signal level detection mechanisms cover this fault with probability one. Other- 
wise, the system enters state E. Another reason of direct transition from state F 
to state D is the execution of a diagnostic program. The transition duration from 
state F to state D is assumed to be exponentially distributed with parameter u 
while the diagnositic program is running. 

Once the system enters state E, the erroneous information starts tc spread 
until function level detection mechanisms identify any unacceptable result. 
There are two paths to indicate this detection which have transition rates /? and 
y, respectively. (3 should be greater than y since the existing fault in state E 
could induce more errors which may spread with high probability. In addition, 
the execution of a diagnostic program can also explore the Tault in state E. 

The model as described above is very general for covering the processes of 
error detection. The transition rates are dependent upon l). the error detection 
mechanisms employed in the system, 2). the operations executed in the system, 
and 3). the characteristics of the concerned physical devices. 


4. EVALUATION OF THE IMPACT OF ERROR LATENCY 

4.1. Formulation of detection processes 

Let a computer system incorporate the three types of error detection 
mechanisms discussed above. We are interested in both the useful computation 
time before the detection of error and the consequence of error latency. The 
diagnostic program is executed for period t. p after every normal operation 
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period t n as shown in Figure 3. Thus the coverage of a single diagnostic, denoted 
as £, is equal to 1-e p for each diagnostic period. The overhead for swapping 
between the normal operation and the diagnostic is denoted by t v . The signal 
level detection provides a coverage c to detect error immediately. If the func- 
tion level detection mechanism finds an error, the system may apply one of vari- 
ous recovery methods to rescue the contaminated message and computation. 
The recovery overhead is assumed to be a function of error latency, denoted as 
R(f #/ ) where t 9l is error latency. 

Since a latent fault is merely possible to harm system behavior we deal 
only with the error latency instead of the fault latency. Note that there is an 
absorbing state (Detection state) in our Markov model. To distinguish whether 
the error latency exists or not, we divide the state D into state D. and state Dg 
where the transition to state D, has to go through state E. and state Dg is 
reachable directly from state F if the fault is captured before the occurrence of 
error or an error is detected immediately when it occurs. For convenience let's 
number the states NF, F, FB, E, ENF, D^, Dg with i for 1 = 1.2.. ,7 and define the 
transition matrix H 7x7 {t) as follow: 



-A 

A v 
H+v 

_A^_ 

/Ll + U 

0 

0 

0 

0 


0 

-(/z+ai(0 + a 2 (0) 

M 

«i(0 

0 

0 

«c(0 


0 

V 

—v 

0 

0 

0 

0 

# 7 x 7(0 = 

0 

0 

0 

-(m+0(O) 


0(0 

0 


0 

0 

0 

V 

-( 1 '+7(0) 7(0 

0 


0 

0 

0 

0 

0 

0 

0 


, 0 

0 

0 

0 

0 

0 

0 , 


Since the diagnostics are invoked periodically, transition rates Qj(0. a 2 (0. 0(0. 
and y(0 are functions of time which arc defined as follows: 
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(l-c)a if 

other :<ise 

If 71 (f n +fp +f w )<f ^71 (ff» + fp +-f v ) + f n 
otherwise 

if n(i„+tp*, w )<fsn(f n +f p +f v )+f n 
otherwise 

if n {t n + t p + tv )<t^n(t n +t p +t v )+t n 
otherwise 

Thus the transition probability matrix P(0 = J»y(01 can be solved by the for- 
ward Chapman-Kolmogorov equation [17]: 

^p = Pit) Hit) 

where j)y(f) is the probability that the system is in state j at tim~ l given that 
the initial state is i. For the state probabiliti es. tt( 0 = |^r i v 0 ■ [t ) , . . , tt jyt )], we 
have the differential equation: 

pjp «t no Hit) 

where rr t (f) is the probability that the system is in state i at time t giver: rr(0). 
Because of the absorbing property of states D. and Dg, Tr 8 (oo)+rr 7 (ae) = l. 
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a *(0 = 
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4.2. Kstimation of the probability of having an unreliable result 

The execution or a task consists of parallel and/cr serial execution of 
processes. We can always partition the task such hat every process sends the 
computation result to its successors at the end of its execution and receives all 
the input data at th j beginning of execution Thus, each process can be con- 
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sidered as an atomic action [10]. Since an atomic action can he recovered very 
easily, we are not interested in the possible faults/errors within the atomic 
action if these faults /errors are detected. The more serious situation, namely 
the propagation of erroneous information through the system, occurs if the 
error can not be discovered by the end of execution. Let the probability that the 
system has at least one error at the output phase bep g . Since the computation 
result may or may not be contaminated by the errors, we claim that p g is also 
the probability of having an unreliable result. 

Without periodic diagnostic, p g can be represented easily by the error 
latency f g j and the process of error occurrence. Let f t (f) and f VT [t) be the 
probability density functions of t tl and the time between two successive error 
occurrences induced by different types of fault. Then, the probability of having 
an unreliable result, r, # , is given by 


p • =//«„( o 


m 

U-O J C 


0 


dt 


where T k is the process execution time. It is obvious that both the reduction of 
error latency and the avoidance of error can improve the final figure Roth den- 
sity functions can be obtained from ^he forward Chapman-Kolmogorov equation 
which becomes homogeneous in this case 

When a scheduled periodic diagnostic is implemented for the process, the 
resultant p g becomes a function of the time interval between the output 
moment and the previous diagnostic. The shorter this time interval, the more 
reliable the computation result Because of the uncertainty of the task execu- 
tion time, it is difficult to schedule a periodic diagnostic such that the system is 
tested just before the process moves into the ouvput phase. Here, using the pro- 
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posed model, we evaluate the maximum p, , denoted by mrcx (/>,). at which the 
time interval between task completion and the la?t diagnostic is t n . For a pro* 
cobs which sends out result at T k , * '.ax(p B ) is the probability that the system is 
In erroneous states (E or ENF) at time T k , l e. rnax[p,)=n 0 [T k )^n 7 [T k ). In fact. 

because of the Markovian property in each transition, max{p t ) is almost 

1 

independent of the execution time T k when T k is much less than t- . The simu- 

A 

lation results are graphed in Figures 4 and 5. In Figure 4, mor(p, ) starts o 
decrease only when each diagnostic has a higher coverage. Note that Tasar 
showed the running of diagnositc program for first 150^s can cover 98.46/5 
faults of processor [9]. In Figure 5 , we compare four different cases 1 ) without 
diagnostic. 2). with periodic diagnostics and c=0 6, 3). with periodic diagnostics 
and c=0.8. and 4). with periodic diagnostics and doubling detection rate of func- 
tion level detection mechanisms. It is noted that m.az[p t ) is linearly related to 
the coverage of signal level detection and is changed exponentially with respect 
to the detection capability of function level detection mechanisms. 


4.3. Calculation of computation loss 

Given the characteristics of the signal and function lcvei detection mechan- 
isms which are incorporated in the system, a designer may question how much 
time is lost due to faults/errors and how much periodic diagnostics can 
improve the reliability and performance Intuitively, periodic diagnostics can 
decrease the probability of having errors and can thus avoid the crash, but r 
certainly wastes the useful processing time. The following example is used to 
show the variations related to different parameters If an error is detected after 
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execution interval T d , the average computation loss due to fault and/or error, 
CL, is given by: 

CL = E{T t ) ♦ (f(l.,)+f(S(l„)))p. 

where r, d is the probability of detecting error by function level detection 
mechanisms, and E[t,i) is the mean vulu of error latency. E{7*) can be 
expressed as p d E[T di )4 (1-Pd)£’(7’ (i2 ) where T dx and 7* 2 arc the amount of time 
spent before the system is absorbed into states D. and Dg. respectively T dx and 

T d , are random variables with pdf and ** in * } . rcsocctively, where 

P 18v°°) Pl?v“) 

' denotes the derivative with respect to time. The error latency is also a ran- 
dom variable with pdf p' 4g (t). Finally, the percentage of average computation 
lois is given by: 

_CL_ it 9 +t v) s _ g(f.,) + E(/?(Q) 

r ~ c{T d ) ((.+<,+i,) 

Tlie above equation indicates that the time wasted for executing periodic 
diagnostics is a dominating factor to the total computation loss when the system 
is higlJy reliable (i.e. the system has a small X). However, only the task 
currently being executed suffers from the delay due to the error latency and 
recovery overhead This delay in response may cause some snr.ous damages to 
the system if execution of the task is time-critical. With X=*.0 -6 , o = 0 02. /?=0.2, 
7 = 0 . 1 , u=50. the simulation results of the computation loss and the response 
delay due to error latency versus the period of a diagnostic cycle are plotcd in 
Figure 6. Once the cost iunction of response tune for a task and the recovery 
overhead are given, we can easily calculate the total loss and then decide the 
optimal diagnostic schedule which consists of the time interval between two suc- 
cessive diagnostics, / B ,«ind the coverage of each diagnostic, (. 
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Figure 7 presents the response delay due to error latency for different 
combinations of intermittent and permanent faults where greater error latency 
occurs when most faults have a short active time. The improvement in error 
latency by diagnostic appears notable only if the cycle time of diagnostic is not 
much greater than the fault’s active time. However, the computation time is 
also wasted in this case. No ideal method so far has been established to diag- 
nose the intermittent faults. Many computers are able to retry instructions 
whenever an error is detected. This method is useful to make the system sur- 
vive intermittent faults, specially for reading or writing a tape or disk. Once an 
error can be detected immediately after its occurrence, the instruction retry 
method can also be applied in other parts of the system. This implies that signal 
level detection mechanisms should play an important role in fault-tolerant sys- 
tems. 


5. CONCLUSION 


In this paper, we have presented two performance-related evaluations for 
fault-tolerant computers. These two are not usually included in the tranditional 
reliability models since such models do not deal with the process of error detec- 
tion. The first evaluation, the probability of having an unreliable result, indicates 
the degree of confidence in computation result. The suspicion *n the computa- 
tion result is totally due to the deficiency of error detection process. Unfor- 
tunately, this deficiency can not be eliminated completely from any practical 
error detection mechanism. In the second evaluation, we take into account a 
lore detailed computation loss resulting from the occurrence of error, its 
detection end the subsequent recovery. For many cases where a system 
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requires high overhead for error recovery or suffers from an erroneous output, 
the reliability analysis has to quantify these kinds of loss and has to provide a 
good method for estimating the total loss. 

Though there are several assumptions to be justified through expriments, 
the model developed in this paper is general enough to include all aspects from 
fault occurrence to error detection and also various detection mechanisms. As 
shown in both evaluations, the model has systematically dealt with various 
aspects of error detection mechanisms. The results obtained here has high 
potential use in decision making during design or operation phase. The results 
also show favorable strategies of periodic diagnostics: l). for time-critical tasks, 
one can derive an optimal diagnostic cycle to minimize the computation loss and 
the penalty of delayed response, 2). for noncritical tasks, the diagnostic is exe- 
cuted only when the system is idle 
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The error detection process. 




r igure 3. The model for error detection process. 
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Figure 3. A cycle of periodic diagnostic. 
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figure 4. Probability of having an unreliable result versus coverage of 
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Figure 5. Probability of having an unreliable result versus period of 
diagnostic cycle. 

( X=10-«. /x=0.1, i/=0,2. 0=0.2, 0=0 5. 7=0.1, (=0 6 ) 
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Figure 6. Percentage of total loss (case l) and average loss due to er- 
ror latency (case 2) versus period of diagnostic cycle. 

( A=1(T 6 . m=i/= 0.2. a=0 05, /?=0.2, 7=0.1. o=50.0, (=0.9, 
c =0.6 ) 
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Figure 7. Average loss due to error latency versus period of diagnostic 
cycle. 

( A=l0 _e , a=0.05, 0=0.2. 7=0.1. y=50.0. (=0 9. c =0.6 ) 
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